Hultig Crawler


Text Crawler based on Scrapy, Beautifulsoap4 and MySQL as backend


Download

Features






HultigCrawler is a text crawler that crawls all the text from given website recursively. The crawled data is then saved as itmes. These items are URL, Title, Tags and Text. This data is then saved into database using scrapy pipelines.


Circus Tent
Controller
Circus Tent
Controller

About

It is written in python using its libraries. It includes Beautifulsoap4 for html parsing, Scrapy for crawling and some other libraries that are necessry to function properly. MySQL is used as the database for saving the crawled data.

The crawler can be used for most of the sites to crawl text. The instructions to use the program are written in REDME.md file. This is the initial version of program with many improvements coming its way. Any kind of feedback will be valued

Our Team

João Paulo Cordeiro

Assistant Professor (UBI)

Natural Language Processing

Automatic Text Summarizationr

Sebastião Pais

Assistant Professor (UBI)

Statistical Natural Language Processing

Lexical Semantics

M. Luqman Jamil

MS Student (UBI)

IT intern

Natural Language Processing

Contact

Phone
+351 275 242 081